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A  Multivariate  Correlation  Ratio 


by 

* 

Allan  R«  Sampson 

Department  of  Mathematics  and  Statistics 
University  of  Pittsburgh 
Pittsburgh,  PA  15260 

1.  INTRODUCTION  AND  HISTORICAL  BACKGROUND. 

The  correlation  ratio,  n(Y;X),  of  a  random  variable  Y  upon  a 
random  variable  X,  defined  by  (with  suitable  assumptions) 

n(Y;X)  -  {Var(E(Y|x))/Var  Y)1*,  (1.1) 

was  first  Introduced  by  K.  Pearson  in  (1903,  p.  304),  who  wrote 

"n  Is  a  useful  constant  which  ought  always  to  be  given  for  non-linear 

systems... it  measures  the  approach  of  the  system  not  only  to  linearity 

but  to  single  valued  relationship,  l.e. ,  to  a  causal  nexus".  Pearson 

further  discussed  n  in  his  papers  of  (1905;  or  see  (1948,  pp.  *77-528)) 

and  (1909).  In  his  1905  paper,  he  wrote  "the  correlation  ratio... Is  an 

excellent  measure  of  the  stringency  of  correlation  always  lying  numerically 

between  the  values  0  and  1,  which  mark  absolute  Independence  and 

complete  causation  respectively".  He  further  noted,  based  on  his  con- 

*■ 

slderatlons  of  non-normal  bivariate  data,  that  "the  ease  with  which 
n  can  be  calculated  suggests  that  In  many  cases  It  should  accompany,  if 
not  replace  the  determination  of  the  correlation  coefficient". 

Blakeman  (1905)  also  Introduced  a  criteria  based  on  n  to  test 
for  linearity  of  regression.  Fisher  (1925;  pp.  257-260  of  14th  Ed.  (1970)), 
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seemingly  less  enthusiastic  about  n,  wrote  concerning  the  sample  analogue  of 
n  In  the  regression  model  that  "as  a  descriptive  statistic  the 
utility  of  the  correlation  ratio  Is  extremely  limited".  It  appears 
that  much  of  his  concerns  were  based  on  certain  distributional  pro¬ 
perties.  A  more  recent  discussion  concerning  r\  can  be  found  In 
Lancaster  (1969,  pp.  201-202). 

Various  other  properties  of  n  have  been  considered  within  the 
literature  which  focuses  on  measures  of  association  and  measures  of 
dependence,  most  of  this  literature  having  been  written  within  the 
last  approximately  20  years.  Kruskal  (1958)  In  his  survey  on  ordinal 

A 

measures  of  association  discussed  n,  and  Renyl  (1959)  In  his  axiomatic 
development  of  measures  of  dependence  examined  properties  of  n.  More 
recently.  Hall  (1970)  defined  the  dependence  characteristic  function  as 
n(e  ;X),  where  a  suitable  extension  of  n  to  complex-valued  random 
variables  was  given;  the  relative  merits  of  the  dependence  characteristic 
function  versus  the  col  relation  ratio  were  considered.  Kotz  and  Soong 
(1977)  further  reviewed  some  of  the  probabilistic  properties  of  n .  Hall 
(1970)  also  noted  that  when  X  Is  vector  valued;  the  correlation  ratio 
n(Y;X),  defined  by  (1.1)  now  with  E(Y|X)  In  the  numerator,  has  essentially 
the  same  properties  as  when  X  Is  a  scalar  random  variable.  Within  a 
specific  multivariate  normal  setting,  Johnson  and  Rots  (1972,  p.  186) 
noted  that  a  certain  multivariate  beta  random  variable  could  be  viewed 
as  a  multivariate  generalisation  of  n. 

The  correlation  ratio  Is  in  soma  ways  connected  to  the  sup-correlation 
coefficient  between  random  variables  X  and  T,  defined  by. 
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p  *  (X,Y)  -  sup  p(f  (X)  ,g(Y)) ,  (1.2) 

where  the  supremia  Is  over  suitable  functions  f  and  g,  and  p 
Is  the  Pearson  correlation  coefficient.  This  measure  of  dependence 
was  Introduced  by  Gebeleln  (1941)  and  developed  further  by  Sarmanov 
(1958A),  (1958B),  Renyi  (1959)  and  Lancaster  (whose  work  Is  summarized 
In  Lancaster  (1969)).  Sarmanov  and  Zaharov  (1960)  extended  this  concept 
to  the  multivariate  case,  defining  p ' (X,y)  -  sup  p(f (X),g(T))»  where  the 
supremum  Is  over  suitable  f,g,  which  map  RP  and  R**,  respectively. 

Into  R*.  We  note  that  except  in  very  special  cases.  It  Is  difficult 
to  obtain  an  explicit  evaluation  of  p'(X,T). 

In  this  paper,  we  consider  defining  the  correlation  ratio  for  the 
case  when  both  Y  and  X  are  vector  random  variables.  This  extension 
would,  for  example,  accomodate  the  situation  when  we  are  studying  the 

relationship  of  . *r+s  t0  » •  •  •  *xr »  or  when  we  are  relating 

jointly  a  time  series  Y. , . , . ,Y  to  a  time  series  X. , . . . ,X  .  The 
properties  of  this  multivariate  correlation  ratio  are  explored  In  light  of 
the  properties  of  p(Y;X).  We  also  examine  maximizing  the  multivariate 
correlation  ratio  over  certain  linear  combinations,  and  study  the 
relationship  of  this  concept  to  other  multivariate  notions.  Including 
the  sup-correlation.  A  number  of  specific  multivariate  distributional 
examples  are  considered  including  the  normal,  elliptlcally  symmetric 
and  Farlie-Morgenstern-Gurabel . 

2.  A  REVIEW  OF  RESULTS  PERTAINIRG  TO  n(Y;X). 

In  this  section,  we  survey  some  results  concerning  n(Y;X)  and 
discuss  briefly  some  of  the  Implications  of  these  results. 
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Theorem  2.1.  Let  Y  be  a  random  variable  with  0  <  Var  Y  <  •». 

Let  X  be  a  p-dlmensional  random  vector,  jointly  distributed  with  Y. 

2 

(a)  Then  min  E(Y-g(X))  occurs  at  g(x)  “  E(Y|5  ■  x),  where  the 

minimum  is  over  all  measurable  g:  Rp  -+•  R1,  for  which  E(Y-g(X))2  <  ». 

2 

Cb)  Furthermore,  the  minimum  value  is  (1-T)  (Y;X))  Var  Y  . 

A  proof  of  essentially  Theorem  2.1  can  be  found,  for  example, 
in  Parzen  (1960)  or  Hall  (1970) . 

Theorem  2.2.  Let  Y  be  a  random  variable  with  0  <  Var  Y  <  *». 

Let  X  be  a  p-dlmensional  random  vector  jointly  distributed  with  Y. 

(a)  Then  max|p(Y,g(X)) |  occurs  at  g(x)  -  E(Y|X  -  x),  where  the 
maximum  is  over  all  measurable  functions  g:  Rp  R1  for  which  the 
correlation  is  defined,  (b)  Furthermore,  the  maximum  value  is  n(Y;X). 

Note  that  in  Theorem  2.2  if  g(x)  maximizes,  then  a  g(x)  +  B, 

«  I*  0,  maximizes.  A  proof  of  Theorem  2.2  in  the  case  X  is  a  scalar 
can  be  found  in  Kotz  and  Soong  (1977) ;  the  proof  when  X  is  a  vector 
is  identical.  This  very  interesting  interpretation  of  n(Y;X),  according 
to  Kruskal  (1958),  was  first  noted  by  Frechet  (1933),  (1934).  Earlier, 
Pearson  (1905)  had  proved  n(Y;X)  >  |p(Y,X)|  and  Fisher  (1925)  had 
shown  u(Y{X)  >  max|p(Y,g(X))|. 

An  lanedlate  result  of  Theorem  2.2  is  that  0  <  n  <  1.  From 
Theorem  2.1,  we  observe  that  Y  being  predicted  by  g(X)  with  an 
expected  squared  error  of  zero  is  equivalent  to  n(Y;X)  -  1.  A 
further  consequence  of  Theorem  2.2  is  that  n(Y;X)  -  0  Is  equivalent 
to  p(Y,h(X))  ■  0  for  all  measurable  functions  h,  with  0  <  Var  h(X)  <  •. 
This  also  implies  that  E(Y|x  -  x)  -  E(Y)  a.e.  for  all  x. 

One  CMftonly  perceived  deficiency  of  n  ae  e  treasure  of  dependence 
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(e.g.,  Renyi  (1959))  is  that  n  ■  0  does  not  imply  T  and  X  are 
Independent.  The  correlation  ratio  being  zero  may  be  Interpreted  as 
being  "between"  Independence  and  uncorrelatedness  (in  terms  of  multiple 
correlation)  in  the  following  sense.  Independence  of  7  and  X  is 
equivalent  to  p( f(Y),g(X))  ■  0  for  all  suitable  f ,g;  and  uncorrelat¬ 
edness  is  equivalent  to  p(ajY+B^,  a^X+B^)  -  0  for  all  ,  $  ,  a2, 
while  zero  correlation  ratio,  as  noted,  is  equivalent  to 
pfa^Y+g^,  g(X))  ■  0,  for  all  a^,  6^,  and  suitable  g. 

When  (Y,X)'  have  a  joint  multivariate  normal  distribution,  the 

multiple  correlation  Is  defined  (e.g.,  AnderBon  (1958))  by  max  p(Y,a'X). 

a 

For  the  multivariate  normal  E(Y|x)  is  linear  in  X,  so  that' it  follows 

from  Theorems  2.1  and  2.2  that  n(Y;X)  is  the  same  as  the  multiple 

correlation  of  Y  with  X.  For  the  linear  prediction  problem  where  g 

in  Theorems  2.1  and  2.2  is  restricted  to  be  linear,  it  is  more  appropriate 

(e.g.,  Parzen  (I960))  to  use  max  p(Y,a’X)  than  n(Y;X)  for  measuring 

a 

dependence.  This  suggests,  perhaps,  chat  when  X  is  an  arbitrary  random 
vector,  one  might  adopt  the  nomenclature,  linear  multiple  correlation 
generalized  multiple  correlation,  and  aup-multlple-eortelatlon,  respectively 

for  sup  p(Y,a*X),  n(Y;X),  and  sup  p(f (Y) ,g(X)) .  "For  the  multivariate 

«  '  f.R 

normal  these  three  measures  are  equal. 

3.  A  MULTIVARIATE  CORRELATION  RATIO. 

In  this  section,  we  consider  the  definition  of  a  correlation  ratio 
of  X  upon  X ,  where  X  1*  *  q  *  1  random  vector  and  X  is  a  p  *  1 
random  vector.  Even  though  Y  is  now  a  vector,  it  seems  suitable  to 
have  the  multivariate  correlation  retie  taka  sealer  values *  rather  than 


attempt  a  multiple-valued  version.  To  motivate  our  definition  of 
the  multivariate  correlation  ratio  we  first  consider  the  multivariate 
prediction  problem  and  proceed  from  there. 

In  referring  to  the  covariance  matrix  of  a  random  vector  W,  we 
use  the  notation  Cov(U)  and  Interchangeably.  The  cross-covariance 

between  random  vectors  S,T  is  denoted  by  Cov(S,T)  “  E[(S-E(S) ) (T-E(T)) ’ ] . 

2  .1 

For  (random)  vectors,  the  notation  | |x| |  «*  X'A-iX  is  employed,  where 

A  is  a  positive  definite  matrix. 

Theorem  3.1.  Let  Y:  q  x  1,  X:  p  x  1  be  jointly  distributed 
random  vectors  with  0  <  tr  <  «•;  and  let  A:  q  x  q  be  positive 

o 

definite.  Then  min  E||Y-g(X)||A  occurs  at  g(x)  -  E(Y|x  -  x) ,  where 

P  0 

the  minimum  is  over  all  measurable  g:  R  Rq ,  for  which 
E||Y-g(X)||2  <  .. 

Proof .  The  result  is  Immediate  from  the  well-known  identity 
E||Y-g(X)||A  -  E|}Y-E(Y|X)||a  +  E||g(X)-E(Y|x)||A. 

Theorem  3.2.  Let  Y:  q  x  1,  X:  p  x  1  be  jointly  distributed  random 
vectors  with  0  <  tr  1^  <  “*  and  A:  q  x  q  be  positive  definite.  Then 
E||Y-E(Y|X)||J  -  trCA"1^) (l-{  [tr(A-1Cov  E(x|x))]/[tr(A'1EY)]}). 

Proof.  Note  E||y-E(y|x)||a  -  e||y||*  -  E| |e(y|x)| |J.  Without  loss 
of  generality,  assume  EY  -  0.  It  is  readily  shown  that 
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e||E(Y|X)||*  -  tr { A**1Cov (E  (Y  |  X) )  ] ,  and  e||y||J  -  tr^”1^),  so  that 
the  result  follows. 

Observe  that  without  any  knowledge  of  X  the  best  expected 
minimum  normed  predictor  of  Y  is  E(Y)  In  the  sense  that 
min  e||Y-c||a  •  tr(A  E^)  and  occurs  at  c  ■  E(Y).  Thus,  Theorem 
3.2,  like  Theorem  2.1b,  allows  the  measurement  of  the  reduction  in 
mean  normed  error  due  to  knowledge  of  X.  This  then  suggests  a  suitable 
definition  of  a  multivariate  correlation  ratio. 

Definition  3.1.  Let  Y:  q  x  1  and  X  be  jointly  distributed 
random  vectors  with  0  <  tr  E^  <  «®.  The  correlation  ratio  of  Y  upon 
X  is  given  by 

„a(Y;X)  . 

where  A  is  a  positive  definite  q  x  q  matrix. 

In  the  case  when  Y  la  a  scalar  this  definition  clearly  reduces 
to  the  previous  definition  (1.1)  of  correlation  ratio.  Obvious  possible 
choices  of  A  are  A  ■  I  and  A  ■  Ey,  when  Ey  Is  nonsingular.  Note 
that  n*(Y;X)  -  (tr  Cdv  E(Y|x))/(tr  Ey)  and  that  n^(Y;X) 

■  q_1  tr(E^**Cov  E(Y|X)E^*5).  In  practice,  the  actual  choice  of  A 
might  depend  on  how  difficult  A  would  be  to  estimate  from  the  data  for 
a  particular  multivariate  model.  It  may  be  the  case  that  tr  Ey  is 
easier  to  estimate  than  the  entire  matrix  Ey,  so  that  A  ■  I  should 
be  chosen. 

Recall  that  when  Y  is  a  scalar,  we  have  the  Important  result  that 

q(Y|g)  »  max  P(Yjg(X))»  Whan  f  la  A  va*kd*  thAts  la  an  Analogous 
g 


..  5  V’Vr.f* 


«/* 

« 


result;  however,  we  nust  Introduce  a|  suitable  version  of  "correlation1 
between  two  vectors. 


Definition  3.2.  Let  S:  p  *  1  and  T:  p  *  1  be  two  Jointly 
distributed  random  vectors  with  0  <  tr  <  ®,  and  0  <  tr  <  <*>. 
The  vector  correlation  between  S  and  T  is  defined  by 


PA<§.I) 


-trlA'lcpyXS,!)? _ 

(tr(A'1Es))%(tr(A~1ET))V 


where  A  is  a  p  x  p  positive  definite  matrix. 

Note  that  pa<S,T)  -  pA(T,S),  |pa(S,T)|  <  1,  and 
Pa(S,-T)  »  ~Pa(§.T).  For  S,T  scalars,  pA(S,T)  »  p(S,T). 


Theorem  3.3.  Let  Y:  q  x  l,  X:  p  x  1  be  Jointly  distributed 
random  vectors  with  0  <  tr  <  •;  and  let  A:  q  *  q  be  positive 
definite.  Then  max|pA(Y,g(X)) |  ■  nA(Y;X),  where  the  maximum  is  taken 
over  all  measurable  functions  g:  Rp  R**,  for  which  the  vector 
correlation  is  defined. 

Proof.  Without  loss  of  generality,  assume  E(T)  ■  0  and  Eg(X)  -  Q. 
Prom  Theorem  3.1,  it  follows  that 


E||T-E(T|X)||2  <  Ej|T-cg  |<X>||J, 


0.1. 


for  all  |  and  constants  c  ,  which  can  depend  on  g.  Expand  (3.1) 
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and  choose  e  ■  {[tr(A~*Cov  E(YlX)) J/(tr(A~*Cov  gOO)!)**  to  obtain 

g  *  *  *  •  •  « 


Cov(A_1T,E(T|X))  >  c  Cov(A-1T,|(X)), 


(3.2: 


Divide  both  aides  of  (3.2)  by  [ (tr  (A-1^)  ) (tr (A-1Cov  E^lx))}1*  to 
yield 

0A(Y,E(Y|X))  >  pA(T,g(X)),  (3.3) 

for  ell  measurable  g.  Because  (3.3)  holds  for  g  and  -g,  and 
because  pA(S,-T)  -  -pA(S,T),  (3.3)  holds  with  the  right  hand  side 
of  (3.3)  in  absolute  values.  Now 

(trCA'^^OrjX)  -  [tr(A‘1Cov  eotIx))]1* 

-  [e|  |e(y|x)  1 |AJ** 

The  result  now  follows  from  (3.3)  and  (3.4). 

Corollary  3.1.  Let  T:  q  «  1,  X:  p  x  l  be  Jointly  distributed 
random  vectors  with  0  <  tr  <  »;  and  let  A:  q  *  q  be  positive 
definite.  Then  nA(Y;X)  -  pa(Y,E(T|x)). 

I 

Proof.  This  is  the  result  of  (3.4). 

Observe  that  if  E(T|X)  Is  linear  in  X,  then  nA(T;X)  ■  max|pA(Y,BX) | , 
where  the  maximum  is  taken  over  all  matrices  gs  q  x  p. 

Before  considering  a  number  of  examples,  we  briefly  discuss  further 

I 

properties  of  the  vector  correlation  coefficient  when  A  »  1.  It  is 
easily  shown  that 


[tr(A-1Cov  E(Y|X))] 


(3.4) 
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PjCS.T)  -  f  (a1B1),J  p(S1,T1), 

where  S  -  ($1#. . . ,Sp) ' ,  T  -  . Tp)';  end  -  V«r  S^tt  Ver  S^, 

B±  -  Ver  T^/ (E  Ver  Tj).  Note  0  B±  <.  1  and  E  «t  ■  E  ■  1; 

thus,  Pj(S,T)  can  almost  be  viewed  as  a  weighted  average  of  the  pair- 

wise  correlations.  If  Var  S.  -  ...  -  Var  S  and  Var  T,  -  ...  -  Var  T  . 

p  1  pi  P* 

then  P_(§,T)  -  (  \  p(S.,T.))/p,  If  Var  S.  »  Var  S,  -  ...  -  Var  S 

i  !■!  1  x  12  p 

and  Var  Tj  »  Var  T2  -  ...  -  Var  T  ,  then  Pj(S,T)  »  p^.Tj),  Let 
§*  -  [S  +  a.  T*  -  TT  +  b,  where  T  Is  a  p  *  p  orthonormal  matrix, 
and  a,  b  are  arbitrary  vectors;  then  p^SO.T*)  -  Pl(S,T). 

Example  3.1.  Multivariate  Normal. 

Let  <X' ,Y')' «\,  N(0,E),  where  £  is  q*l,  Y  is  p»l  and 
E  Is  partitioned  slmllarlly,  i.e., 

(3.5) 

Then  E(X|Y)  -  E^E"1  Y  and  Cov(E(x|Y»  -  ^nce, 

DjftjY)  -  <tr(EXYE“1E^))>s(trEx>'% 
and 

\*V  -  'i"1  ^-x,hh^h- 

q  . 

Observe  that  n-  (X;Y)  -  (q”1  T  CT(X,Y)}\  where  C.(X,Y)  is  the  1th 
lX  ~  1-1  1  *  *  1 

canonical  correlation  between  X>  Y,  i  If  p  -  q(  then 
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Pj(X,Y)  -  (tr  S^Utr  £x>(tr  ?Y)  >“**  and  pj.  -  trCE^E^Mp  tr(E^1ET)}J*. 

Example  3.2.  Elllptlcally  Symmetric  Distributions. 

Let  Z  ■  (X'jY')'  have  p.d.f.  c | ▼  | ""**  0(*,Y~1z).  (Such  distributions 
are  called  elllptlcally  symmetric,  e.g.,  Kelker  (1970).)  Let  X  be  q  x  1, 
Y  be  p  «  1  and  suppose  V  is  partitioned  as  In  (3.5),  where  we  assume 
0  <tr  Cov(X)«*.  Then  E(x|Y)  -  Y^Y"1  X  and  Cov(E(X|Y)  -  a  TXY'r~1,r^Y; 
also  Cov(X)  -  a  fx,  so  that  nx(X;Y)  -  (tr  }**(tr  and 

tr(-X^XY'Y  -XY-X*^1** 


Example  3.3.  Multivariate  Farlle-Gumbel-Horgenstern  (FCM)  Distribution. 
In  order  to  have  tractable  results,  we  consider  a  special  four- 
dlmenalonal  FGM  distribution  with  uniform  marginals.  (See  Johnson  and 
Kota  (1975)  for  the  general  multivariate  distribution.)  Let 


\  ,X2  ,Xj  ,X,  ^*1  ,x2  ,x3  ,X4^ 


1  +  a12V12  +  °13V13  +  “l4V14 


+  °23V23  +  °24V24 


+  a123V123 


+  a124V124  +  a1234v1234» 


(3.6) 


where  v^  -  (l-2x1)  (l-2x^  ) ,  v^  -  (l-ZXj)  (l-2Xj  )  (l-2xk) ,  and 
vtju  -  (l-Zx1)(l-2Xj)(l-2xk)(l-2x^).  Direct  calculation  yields  that 
EOtjJX^ X4)  -  ft  -  a13/6  -  a14/6)  +  (a13/3)X3  +  (a14/3)X4,  end 
E(X2jX3,X4)  -  ft  -  o23/6  -  a24/6)  +  (<*23/3)X3  +  (e24/3)X4.  Further 
calculations  yield  that 
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cov(e(x1,x2|x3,x4))'  - 


2.2 
°13  +  °14 


a13a23  +  °14a24 


n13a23  *  a14a24  a23  *  °24 


(3.7) 


Cw(X1,X2)*  - 


(3.8) 


Hence,  for  this  example  ^((Xj.Xj) ' ;  (Xj.X^)1)  -  (2/3 (a*3  +  a*4  +  «23  +  ai^)}** 
and  ru  ((X1,X2)-;(X3X4),)  -  {  (2/9)(l-4aJ2)_1|3(aJ3  +  +  a^) 

-  2(°12aI3a23  +  a12°14°24^'  »  where  is  given  by  (3.8). 

Suppose  that  f  Is  given  by  (3. 6X  where  t*123  -  a124  ■  a12  ■  0. 

By  symmetry  ^((X^X^Mx^Xj)’)  -  (2/3(a*3  +  a*4  +  +  a^)}*5. 

Thus,  If  we  set  ■  Oj4  ■  <*23  "  »24  ■  0,  we  have  (Xj.XjV  and 
(X3,X4)'  mutually  have  zero  correlation  ratio,  each  upon  the  other; 
yet  (X^.Xj) '  and  (X2,X4)'  are  not  Independent  vectors.  Very  roughly 
speaking,  <*3234  measures  a  "higher  order"  level  of  dependence  that  is 
not  measured  In  this  example  by  rij  and  ,  where  E^2  Is  given  by  (3.8). 

Example  3.4.  Multinomial. 


Let  X  ■  (Xj, , . . ,Xj) ’  have  p.m.f. 


V*’  * 


n  x.  J  (n-Xx. ) ! 
1«1  *  1 


Jx.  n-Ex. 


Then  E((X1#...,XJ)|(XJ+1#...,X1))  •  n*(p*. .. . ,pj),  where  n*  ■  n-(XJ+1-*....+X1), 


and  p*  ■  p^/(l-(pJ+^  +  •••  +  Pj))»  i"l*...»J.  It  follows  that 
Cov  E((X^t •  •  •  |Xj)  |  (Xj_^ » •  •  •  ,Xj)  )  "  5i»2» 

-1*2  “  *n*<Pi  4ij  "  P*Pj>^»  <3*9> 

where  d^  *  1,  If  i  *  J,  and  »  0,  otherwise.  Since  Cov(X^,...,Xj)*  5  E^ 

la  of  the  same  form  as  (3.9),  with  n*,  p*  replaced  by  n,  p^,  It  follows 

that  nI((X1,...,XJ),;(XJ+1 . Xj)')  -  (n*/n)(E  p*(l-p*))/(E  P1(l-P1>). 

Observe  that  (Cov(X- , . . .  ,XT)  ')  *  -  {n***(p.*d. .  +  (1  -  f  p.)  *)},  so 

kmi 

that  nri((X1,...,XJ)';(XJ+1,...,XI)*)  -  (J)  tj^d-pj)  (pj/p^)) , 

where  for  1  ■  J  +  1,  p*+1  ■  l-(pj  +  ...  +  pj)  and  pJ+1  ■  1-(p1  +  ...  +  Pj). 

4,  MOST  PREDICTABLE  LINEAR  FUNCTIONS. 

The  multivariate  correlation  ratio  n. (Y;X)  measures  the  amount 

A  *•  - 

of  predictability  X  has  for  Y  for  any  Jointly  distributed  random 
vectors  X  and  Y.  This  relational  notion  is,  of  course,  directional 
in  that  we  are  using  the  Information  In  X  In  order  to  predict  Y. 

To  further  understand  this  predictive  relationship.  It  Is  natural  to 
attempt  to  find  what  Information  In  Y,  that  is,  what  function  of  Y, 

Is  most  predictable  from  X.  If  we  allow  the  consideration  of  all 
suitable  measurable  functions  of  Y,  and  follow  the  minimum  norm  approach 
with  £  ■  X,  we  are  In  essence,  evaluating  the  sup-correlation  p'(X,Y) 
and  finding  the  appropriate  maximising  functions.  As  noted  previously, 
this  often  is  a  difficult  task  mathematically.  Consequently,  we  proceed 
analogously  to  the  theory  of  principal  components  and  canonical  correlations, 
and  restrict  attention  to  linear  functions  of  Y.  The  goal  then  la  to  find 
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which  linear  functions  of  Y  are  most  predictable  fron  X  and  to 
measure  this  degree  of  predictability. 

For  simplicity,  we  assume  in  the  following  that  Ey  is  non- 
singular;  however,  this  could  be  avoided  by  examining  determinantal 
equations . 


Theorem  4,1.  Let  q  *  1,  X:  p  *  1  be  jointly  distributed 
random  vectors  with  tr  E^  <  •  and  ^  nonsingular.  Then 


max  n2(g'Y;X)  -  X^E^Cov  E(Y|X)E^), 
B 


(4.1) 


where  X^^  denotes  the  largest  eigenvalue.  Furthermore,  this  max 
»)| 

occurs  at  B  ■  c  E^  ?maX’  whete  c  ls  anY  nonzero  constant,  and  e^^ 
is  an  eigenvector  corresponding  to  X^^. 


Proof. 


y  *>2<8’ts5)  -  y  v;ir^fe 


T  e  M 


ov  E(ll?)5iS 


■  Max  ■  t 

T  n 


where  ^  so  that  the  result  is  lmnedlate  (e.g.,  Rao  (1973,  p.  62)). 


An  issMdlate  implication  of  Theorem  4.1  is  the  following  corollary. 


whose  proof  is  obvious 
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Corollary  4.1.  Let  Y:  q  x  l,  X:  p  x  1  be  jointly  distributed 
random  vectors  with  tr  E^  <  •  and  E^  nonsingular.  Then 

Max  p2(e'Y,g(X))  -  X^E^Cov  E(Y|X)E^*), 

§:  q  *  1 

g:  Rp  -*•  R1 

35 

and  occurs  at  g  -  c  E^*5  e^  and  g(x)  ■  c  e^  E^E  (Y  |  X) . 

The  result  of  Corollary  4.1  Is  In  some  sense  "half  way"  between 
linear  canonical  correlation  for  arbitrary  random  vectors  and  the  sup- 
correlation  of  Sarmanov  and  Zaharov  (1960). 

For  convenience,  we  Introduce  the  following  definition: 

Definition  4.1.  L^pX)  -  {X^E^Cov  E(Y|x)E^*) J5*,  where  X± 
denotes  the  1th  largest  eigenvalue. 

When  J:  q  *  1  and  X:  p  x  1  have  a  Joint  multivariate  normal 

distribution  L^(Y;X)  ■  (Y,X) ,  the  first  canonical  correlation.  For 

Example  3.3,  L^C^.X^’jCXj.X^)’)  Is  the  largest  root  of  the 

determlnantal  equation  1^2*34  *12  ^  ®  where  1^2  >34  *8  R*ven 

♦ 

by  (3.7)  and  £12  by  (3.8). 

As  In  canonical  correlation  theory,  multivariate  versions  of 
Theorem  4.1  could  be  obtained  by  examining  uncorrelated  iterations  of 
(4.1).  Another  approach  which  we  follow  Is  to  employ  ry.  Suppose 
Y  is  q  x  l  md  X  Is  p  »  1;  let  B  be  an  r  x  q  matrix  and  A 

be  an  r  x  r  positive  definite  matrix.  Consider  maximising  nA(gf;J) 

over  B  satisfying  •  ^“1.  When  £  ■  J,  this  Is  squlvalent  to 

bnesmlBfcsdntii  Mohg  the  efttrlii  of  |  *. 
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Theorem  4.2.  Let  T:  q  »  1,  X:  p  *  1  be  Jointly  distributed 
random  vectors  with  tr  <  ■  and  nonsingular;  and  let  A:  r  *  r 
be  positive  definite.  Then 


n .  (BY ;X)  -  {r"1  f  L?^;?)}*1. 

A  ”  i«i  1 

■  6_1 

Proof.  Observe  that 

max  ik  ■  max  tr(A  *Cov  E(BY|X))/(tr  A  *1^) 
B  B 

* 

.  max  Cov 

B  tr (A^B  E^B'A-5*) 


•  r_1  max  tr  C  iC^Cov  EONx)!^*  C', 

C:  CC’  -  I~ 

where  C  -  A”**  B  The  result  follows  lnmedlately  (e.g.,  Rao  (1973, 
pp.  63)). 

2 

For  X,Y  having  a  multivariate  normal  distribution  max  n. (BY;X)  ■  r 

B  A 

i.e.,  the  average  of  the  squares  of  the  first  r  canonical  correlations. 
An  actual  maximizing  B  in  Theorem  4.2  is  given  by 

?  "  V* (fj/.... :«,)’£**. 

where  e^  is  an  eigenvector  corresponding  to  Lj(Y;X)  and  . . . ,fr 
are  orthogonal  vectors. 

The  q  values,  {J^  L*(Y;X))**,  J  -  l,...,q,  could  themselves  be 
viewed  as  measures  of  dependenees  of  X  *F°°  V  and  their  properties 
explored.  Note  that  knowing  these  q  veluee  le  equivalent  to  knowing 


B:  r  x  q 
Cov  (BY) 


<1 


the  q  eigenvalues,  I^OTjX),  1  ■  1 

For  a  fixed  A,  the  estlaation  of  oA(Y;X)  baaed  upon  n 
observations  would  be  of  interest,  as  would  the  estlaation  of  L^CYjX). 
Both  the  actual  estimation  techniques  employed  and  the  resultant 
distribution  theory  would  be  dependent  upon  the  underlying  model 
assumptions. 
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